Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 233
Filter
1.
Sci Data ; 11(1): 358, 2024 Apr 09.
Article in English | MEDLINE | ID: mdl-38594314

ABSTRACT

This paper presents a standardised dataset versioning framework for improved reusability, recognition and data version tracking, facilitating comparisons and informed decision-making for data usability and workflow integration. The framework adopts a software engineering-like data versioning nomenclature ("major.minor.patch") and incorporates data schema principles to promote reproducibility and collaboration. To quantify changes in statistical properties over time, the concept of data drift metrics (d) is introduced. Three metrics (dP, dE,PCA, and dE,AE) based on unsupervised Machine Learning techniques (Principal Component Analysis and Autoencoders) are evaluated for dataset creation, update, and deletion. The optimal choice is the dE,PCA metric, combining PCA models with splines. It exhibits efficient computational time, with values below 50 for new dataset batches and values consistent with seasonal or trend variations. Major updates (i.e., values of 100) occur when scaling transformations are applied to over 30% of variables while efficiently handling information loss, yielding values close to 0. This metric achieved a favourable trade-off between interpretability, robustness against information loss, and computation time.


Subject(s)
Datasets as Topic , Software , Principal Component Analysis , Reproducibility of Results , Workflow , Datasets as Topic/standards , Machine Learning
2.
Sci Data ; 10(1): 99, 2023 02 23.
Article in English | MEDLINE | ID: mdl-36823157

ABSTRACT

Biomedical datasets are increasing in size, stored in many repositories, and face challenges in FAIRness (findability, accessibility, interoperability, reusability). As a Consortium of infectious disease researchers from 15 Centers, we aim to adopt open science practices to promote transparency, encourage reproducibility, and accelerate research advances through data reuse. To improve FAIRness of our datasets and computational tools, we evaluated metadata standards across established biomedical data repositories. The vast majority do not adhere to a single standard, such as Schema.org, which is widely-adopted by generalist repositories. Consequently, datasets in these repositories are not findable in aggregation projects like Google Dataset Search. We alleviated this gap by creating a reusable metadata schema based on Schema.org and catalogued nearly 400 datasets and computational tools we collected. The approach is easily reusable to create schemas interoperable with community standards, but customized to a particular context. Our approach enabled data discovery, increased the reusability of datasets from a large research consortium, and accelerated research. Lastly, we discuss ongoing challenges with FAIRness beyond discoverability.


Subject(s)
Communicable Diseases , Datasets as Topic , Metadata , Reproducibility of Results , Datasets as Topic/standards , Humans
3.
IEEE Trans Pattern Anal Mach Intell ; 45(3): 2782-2800, 2023 03.
Article in English | MEDLINE | ID: mdl-35560102

ABSTRACT

Micro-expression (ME) is a significant non-verbal communication clue that reveals one person's genuine emotional state. The development of micro-expression analysis (MEA) has just gained attention in the last decade. However, the small sample size problem constrains the use of deep learning on MEA. Besides, ME samples distribute in six different databases, leading to database bias. Moreover, the ME database development is complicated. In this article, we introduce a large-scale spontaneous ME database: CAS(ME) 3. The contribution of this article is summarized as follows: (1) CAS(ME) 3 offers around 80 hours of videos with over 8,000,000 frames, including manually labeled 1,109 MEs and 3,490 macro-expressions. Such a large sample size allows effective MEA method validation while avoiding database bias. (2) Inspired by psychological experiments, CAS(ME) 3 provides the depth information as an additional modality unprecedentedly, contributing to multi-modal MEA. (3) For the first time, CAS(ME) 3 elicits ME with high ecological validity using the mock crime paradigm, along with physiological and voice signals, contributing to practical MEA. (4) Besides, CAS(ME) 3 provides 1,508 unlabeled videos with more than 4,000,000 frames, i.e., a data platform for unsupervised MEA methods. (5) Finally, we demonstrate the effectiveness of depth information by the proposed depth flow algorithm and RGB-D information.


Subject(s)
Databases, Factual , Emotions , Facial Expression , Female , Humans , Male , Young Adult , Algorithms , Bias , Databases, Factual/standards , Datasets as Topic/standards , Photic Stimulation , Reproducibility of Results , Sample Size , Supervised Machine Learning/standards , Video Recording , Visual Perception
4.
Int J Neural Syst ; 32(9): 2250043, 2022 Sep.
Article in English | MEDLINE | ID: mdl-35912583

ABSTRACT

A practical problem in supervised deep learning for medical image segmentation is the lack of labeled data which is expensive and time-consuming to acquire. In contrast, there is a considerable amount of unlabeled data available in the clinic. To make better use of the unlabeled data and improve the generalization on limited labeled data, in this paper, a novel semi-supervised segmentation method via multi-task curriculum learning is presented. Here, curriculum learning means that when training the network, simpler knowledge is preferentially learned to assist the learning of more difficult knowledge. Concretely, our framework consists of a main segmentation task and two auxiliary tasks, i.e. the feature regression task and target detection task. The two auxiliary tasks predict some relatively simpler image-level attributes and bounding boxes as the pseudo labels for the main segmentation task, enforcing the pixel-level segmentation result to match the distribution of these pseudo labels. In addition, to solve the problem of class imbalance in the images, a bounding-box-based attention (BBA) module is embedded, enabling the segmentation network to concern more about the target region rather than the background. Furthermore, to alleviate the adverse effects caused by the possible deviation of pseudo labels, error tolerance mechanisms are also adopted in the auxiliary tasks, including inequality constraint and bounding-box amplification. Our method is validated on ACDC2017 and PROMISE12 datasets. Experimental results demonstrate that compared with the full supervision method and state-of-the-art semi-supervised methods, our method yields a much better segmentation performance on a small labeled dataset. Code is available at https://github.com/DeepMedLab/MTCL.


Subject(s)
Curriculum , Supervised Machine Learning , Data Curation/methods , Data Curation/standards , Datasets as Topic/standards , Datasets as Topic/supply & distribution , Image Processing, Computer-Assisted/methods , Supervised Machine Learning/classification , Supervised Machine Learning/statistics & numerical data , Supervised Machine Learning/trends
5.
Sci Rep ; 12(1): 14626, 2022 08 26.
Article in English | MEDLINE | ID: mdl-36028547

ABSTRACT

Polyp segmentation has accomplished massive triumph over the years in the field of supervised learning. However, obtaining a vast number of labeled datasets is commonly challenging in the medical domain. To solve this problem, we employ semi-supervised methods and suitably take advantage of unlabeled data to improve the performance of polyp image segmentation. First, we propose an encoder-decoder-based method well suited for the polyp with varying shape, size, and scales. Second, we utilize the teacher-student concept of training the model, where the teacher model is the student model's exponential average. Third, to leverage the unlabeled dataset, we enforce a consistency technique and force the teacher model to generate a similar output on the different perturbed versions of the given input. Finally, we propose a method that upgrades the traditional pseudo-label method by learning the model with continuous update of pseudo-label. We show the efficacy of our proposed method on different polyp datasets, and hence attaining better results in semi-supervised settings. Extensive experiments demonstrate that our proposed method can propagate the unlabeled dataset's essential information to improve performance.


Subject(s)
Polyps/pathology , Supervised Machine Learning , Datasets as Topic/standards , Datasets as Topic/trends , Humans , Image Processing, Computer-Assisted , Polyps/diagnostic imaging
6.
AMIA Annu Symp Proc ; 2022: 662-671, 2022.
Article in English | MEDLINE | ID: mdl-37128396

ABSTRACT

Previous work on clinical relation extraction from free-text sentences leveraged information about semantic types from clinical knowledge bases as a part of entity representations. In this paper, we exploit additional evidence by also making use of domain-specific semantic type dependencies. We encode the relation between a span of tokens matching a Unified Medical Language System (UMLS) concept and other tokens in the sentence. We implement our method and compare against different named entity recognition (NER) architectures (i.e., BiLSTM-CRF and BiLSTM-GCN-CRF) using different pre-trained clinical embeddings (i.e., BERT, BioBERT, UMLSBert). Our experimental results on clinical datasets show that in some cases NER effectiveness can be significantly improved by making use of domain-specific semantic type dependencies. Our work is also the first study generating a matrix encoding to make use of more than three dependencies in one pass for the NER task.


Subject(s)
Natural Language Processing , Semantics , Unified Medical Language System , Humans , Knowledge Bases , Datasets as Topic/standards , Sample Size , Reproducibility of Results
7.
Alzheimers Dement ; 18(1): 29-42, 2022 01.
Article in English | MEDLINE | ID: mdl-33984176

ABSTRACT

INTRODUCTION: Harmonized neuropsychological assessment for neurocognitive disorders, an international priority for valid and reliable diagnostic procedures, has been achieved only in specific countries or research contexts. METHODS: To harmonize the assessment of mild cognitive impairment in Europe, a workshop (Geneva, May 2018) convened stakeholders, methodologists, academic, and non-academic clinicians and experts from European, US, and Australian harmonization initiatives. RESULTS: With formal presentations and thematic working-groups we defined a standard battery consistent with the U.S. Uniform DataSet, version 3, and homogeneous methodology to obtain consistent normative data across tests and languages. Adaptations consist of including two tests specific to typical Alzheimer's disease and behavioral variant frontotemporal dementia. The methodology for harmonized normative data includes consensus definition of cognitively normal controls, classification of confounding factors (age, sex, and education), and calculation of minimum sample sizes. DISCUSSION: This expert consensus allows harmonizing the diagnosis of neurocognitive disorders across European countries and possibly beyond.


Subject(s)
Cognitive Dysfunction , Consensus Development Conferences as Topic , Datasets as Topic/standards , Neuropsychological Tests/standards , Age Factors , Cognition , Cognitive Dysfunction/classification , Cognitive Dysfunction/diagnosis , Educational Status , Europe , Expert Testimony , Humans , Language , Sex Factors
8.
Ann Surg ; 275(3): e549-e561, 2022 03 01.
Article in English | MEDLINE | ID: mdl-34238814

ABSTRACT

OBJECTIVE: The aim of this study to describe a new international dataset for pathology reporting of colorectal cancer surgical specimens, produced under the auspices of the International Collaboration on Cancer Reporting (ICCR). BACKGROUND: Quality of pathology reporting and mutual understanding between colorectal surgeon, pathologist and oncologist are vital to patient management. Some pathology parameters are prone to variable interpretation, resulting in differing positions adopted by existing national datasets. METHODS: The ICCR, a global alliance of major pathology institutions with links to international cancer organizations, has developed and ratified a rigorous and efficient process for the development of evidence-based, structured datasets for pathology reporting of common cancers. Here we describe the production of a dataset for colorectal cancer resection specimens by a multidisciplinary panel of internationally recognized experts. RESULTS: The agreed dataset comprises eighteen core (essential) and seven non-core (recommended) elements identified from a review of current evidence. Areas of contention are addressed, some highly relevant to surgical practice, with the aim of standardizing multidisciplinary discussion. The summation of all core elements is considered to be the minimum reporting standard for individual cases. Commentary is provided, explaining each element's clinical relevance, definitions to be applied where appropriate for the agreed list of value options and the rationale for considering the element as core or non-core. CONCLUSIONS: This first internationally agreed dataset for colorectal cancer pathology reporting promotes standardization of pathology reporting and enhanced clinicopathological communication. Widespread adoption will facilitate international comparisons, multinational clinical trials and help to improve the management of colorectal cancer globally.


Subject(s)
Colorectal Neoplasms/pathology , Datasets as Topic/standards , Research Design , Humans
9.
Nat Biotechnol ; 40(1): 121-130, 2022 01.
Article in English | MEDLINE | ID: mdl-34462589

ABSTRACT

Large single-cell atlases are now routinely generated to serve as references for analysis of smaller-scale studies. Yet learning from reference data is complicated by batch effects between datasets, limited availability of computational resources and sharing restrictions on raw data. Here we introduce a deep learning strategy for mapping query datasets on top of a reference called single-cell architectural surgery (scArches). scArches uses transfer learning and parameter optimization to enable efficient, decentralized, iterative reference building and contextualization of new datasets with existing references without sharing raw data. Using examples from mouse brain, pancreas, immune and whole-organism atlases, we show that scArches preserves biological state information while removing batch effects, despite using four orders of magnitude fewer parameters than de novo integration. scArches generalizes to multimodal reference mapping, allowing imputation of missing modalities. Finally, scArches retains coronavirus disease 2019 (COVID-19) disease variation when mapping to a healthy reference, enabling the discovery of disease-specific cell states. scArches will facilitate collaborative projects by enabling iterative construction, updating, sharing and efficient use of reference atlases.


Subject(s)
Datasets as Topic/standards , Deep Learning , Organ Specificity , Single-Cell Analysis/standards , Animals , COVID-19/pathology , Humans , Mice , Reference Standards , SARS-CoV-2/pathogenicity
10.
Nature ; 600(7890): 695-700, 2021 12.
Article in English | MEDLINE | ID: mdl-34880504

ABSTRACT

Surveys are a crucial tool for understanding public opinion and behaviour, and their accuracy depends on maintaining statistical representativeness of their target populations by minimizing biases from all sources. Increasing data size shrinks confidence intervals but magnifies the effect of survey bias: an instance of the Big Data Paradox1. Here we demonstrate this paradox in estimates of first-dose COVID-19 vaccine uptake in US adults from 9 January to 19 May 2021 from two large surveys: Delphi-Facebook2,3 (about 250,000 responses per week) and Census Household Pulse4 (about 75,000 every two weeks). In May 2021, Delphi-Facebook overestimated uptake by 17 percentage points (14-20 percentage points with 5% benchmark imprecision) and Census Household Pulse by 14 (11-17 percentage points with 5% benchmark imprecision), compared to a retroactively updated benchmark the Centers for Disease Control and Prevention published on 26 May 2021. Moreover, their large sample sizes led to miniscule margins of error on the incorrect estimates. By contrast, an Axios-Ipsos online panel5 with about 1,000 responses per week following survey research best practices6 provided reliable estimates and uncertainty quantification. We decompose observed error using a recent analytic framework1 to explain the inaccuracy in the three surveys. We then analyse the implications for vaccine hesitancy and willingness. We show how a survey of 250,000 respondents can produce an estimate of the population mean that is no more accurate than an estimate from a simple random sample of size 10. Our central message is that data quality matters more than data quantity, and that compensating the former with the latter is a mathematically provable losing proposition.


Subject(s)
COVID-19 Vaccines/administration & dosage , Health Care Surveys , Vaccination/statistics & numerical data , Benchmarking , Bias , Big Data , COVID-19/epidemiology , COVID-19/prevention & control , Centers for Disease Control and Prevention, U.S. , Datasets as Topic/standards , Female , Health Care Surveys/standards , Humans , Male , Research Design , Sample Size , Social Media , United States/epidemiology , Vaccination Hesitancy/statistics & numerical data
11.
Annu Int Conf IEEE Eng Med Biol Soc ; 2021: 2413-2418, 2021 11.
Article in English | MEDLINE | ID: mdl-34891768

ABSTRACT

As neuroimagery datasets continue to grow in size, the complexity of data analyses can require a detailed understanding and implementation of systems computer science for storage, access, processing, and sharing. Currently, several general data standards (e.g., Zarr, HDF5, precomputed) and purpose-built ecosystems (e.g., BossDB, CloudVolume, DVID, and Knossos) exist. Each of these systems has advantages and limitations and is most appropriate for different use cases. Using datasets that don't fit into RAM in this heterogeneous environment is challenging, and significant barriers exist to leverage underlying research investments. In this manuscript, we outline our perspective for how to approach this challenge through the use of community provided, standardized interfaces that unify various computational backends and abstract computer science challenges from the scientist. We introduce desirable design patterns and share our reference implementation called intern.


Subject(s)
Datasets as Topic/standards , Neurosciences
13.
Genes (Basel) ; 12(10)2021 09 28.
Article in English | MEDLINE | ID: mdl-34680918

ABSTRACT

Gene set analysis has been widely used to gain insight from high-throughput expression studies. Although various tools and methods have been developed for gene set analysis, there is no consensus among researchers regarding best practice(s). Most often, evaluation studies have reported contradictory recommendations of which methods are superior. Therefore, an unbiased quantitative framework for evaluations of gene set analysis methods will be valuable. Such a framework requires gene expression datasets where enrichment status of gene sets is known a priori. In the absence of such gold standard datasets, artificial datasets are commonly used for evaluations of gene set analysis methods; however, they often rely on oversimplifying assumptions that make them biased in favor of or against a given method. In this paper, we propose a quantitative framework for evaluation of gene set analysis methods by synthesizing expression datasets using real data, without relying on oversimplifying or unrealistic assumptions, while preserving complex gene-gene correlations and retaining the distribution of expression values. The utility of the quantitative approach is shown by evaluating ten widely used gene set analysis methods. An implementation of the proposed method is publicly available. We suggest using Silver to evaluate existing and new gene set analysis methods. Evaluation using Silver provides a better understanding of current methods and can aid in the development of gene set analysis methods to achieve higher specificity without sacrificing sensitivity.


Subject(s)
Databases, Genetic/standards , Genomics/methods , Software , Datasets as Topic/standards
14.
PLoS One ; 16(8): e0255754, 2021.
Article in English | MEDLINE | ID: mdl-34352030

ABSTRACT

Given multiple source datasets with labels, how can we train a target model with no labeled data? Multi-source domain adaptation (MSDA) aims to train a model using multiple source datasets different from a target dataset in the absence of target data labels. MSDA is a crucial problem applicable to many practical cases where labels for the target data are unavailable due to privacy issues. Existing MSDA frameworks are limited since they align data without considering labels of the features of each domain. They also do not fully utilize the target data without labels and rely on limited feature extraction with a single extractor. In this paper, we propose Multi-EPL, a novel method for MSDA. Multi-EPL exploits label-wise moment matching to align the conditional distributions of the features for the labels, uses pseudolabels for the unavailable target labels, and introduces an ensemble of multiple feature extractors for accurate domain adaptation. Extensive experiments show that Multi-EPL provides the state-of-the-art performance for MSDA tasks in both image domains and text domains, improving the accuracy by up to 13.20%.


Subject(s)
Database Management Systems/standards , Deep Learning , Datasets as Topic/standards
16.
Nature ; 596(7873): 590-596, 2021 08.
Article in English | MEDLINE | ID: mdl-34293799

ABSTRACT

Protein structures can provide invaluable information, both for reasoning about biological processes and for enabling interventions such as structure-based drug development or targeted mutagenesis. After decades of effort, 17% of the total residues in human protein sequences are covered by an experimentally determined structure1. Here we markedly expand the structural coverage of the proteome by applying the state-of-the-art machine learning method, AlphaFold2, at a scale that covers almost the entire human proteome (98.5% of human proteins). The resulting dataset covers 58% of residues with a confident prediction, of which a subset (36% of all residues) have very high confidence. We introduce several metrics developed by building on the AlphaFold model and use them to interpret the dataset, identifying strong multi-domain predictions as well as regions that are likely to be disordered. Finally, we provide some case studies to illustrate how high-quality predictions could be used to generate biological hypotheses. We are making our predictions freely available to the community and anticipate that routine large-scale and high-accuracy structure prediction will become an important tool that will allow new questions to be addressed from a structural perspective.


Subject(s)
Computational Biology/standards , Deep Learning/standards , Models, Molecular , Protein Conformation , Proteome/chemistry , Datasets as Topic/standards , Diacylglycerol O-Acyltransferase/chemistry , Glucose-6-Phosphatase/chemistry , Humans , Membrane Proteins/chemistry , Protein Folding , Reproducibility of Results
17.
J Clin Epidemiol ; 136: 136-145, 2021 08.
Article in English | MEDLINE | ID: mdl-33932483

ABSTRACT

BACKGROUND: Probabilistic linkage can link patients from different clinical databases without the need for personal information. If accurate linkage can be achieved, it would accelerate the use of linked datasets to address important clinical and public health questions. OBJECTIVE: We developed a step-by-step process for probabilistic linkage of national clinical and administrative datasets without personal information, and validated it against deterministic linkage using patient identifiers. STUDY DESIGN AND SETTING: We used electronic health records from the National Bowel Cancer Audit and Hospital Episode Statistics databases for 10,566 bowel cancer patients undergoing emergency surgery in the English National Health Service. RESULTS: Probabilistic linkage linked 81.4% of National Bowel Cancer Audit records to Hospital Episode Statistics, vs. 82.8% using deterministic linkage. No systematic differences were seen between patients that were and were not linked, and regression models for mortality and length of hospital stay according to patient and tumour characteristics were not sensitive to the linkage approach. CONCLUSION: Probabilistic linkage was successful in linking national clinical and administrative datasets for patients undergoing a major surgical procedure. It allows analysts outside highly secure data environments to undertake linkage while minimizing costs and delays, protecting data security, and maintaining linkage quality.


Subject(s)
Data Management/methods , Data Management/statistics & numerical data , Datasets as Topic/standards , Electronic Health Records/statistics & numerical data , Electronic Health Records/standards , Intestinal Neoplasms/epidemiology , Medical Record Linkage/methods , Datasets as Topic/statistics & numerical data , Humans , Intestinal Neoplasms/mortality , Intestinal Neoplasms/surgery , Models, Statistical , Reproducibility of Results , State Medicine , United Kingdom
18.
Hum Pathol ; 114: 54-65, 2021 08.
Article in English | MEDLINE | ID: mdl-33992659

ABSTRACT

BACKGROUND AND OBJECTIVES: A standardized data set for esophageal carcinoma pathology reporting was developed based on the approach of the International Collaboration on Cancer Reporting (ICCR) for the purpose of improving cancer patient outcomes and international benchmarking in cancer management. MATERIALS AND METHODS: The ICCR convened a multidisciplinary international expert panel to identify the best evidence-based clinical and pathological parameters for inclusion in the data set for esophageal carcinoma. The data set incorporated the current edition of the World Health Organization Classification of Tumours of the Digestive System, and Tumour-Node-Metastasis staging systems. RESULTS: The scope of the data set encompassed resection specimens of the esophagus and esophagogastric junction with tumor epicenter ≤20 mm into the proximal stomach. Core reporting elements included information on neoadjuvant therapy, operative procedure used, tumor focality, tumor site, tumor dimensions, distance of tumor to resection margins, histological tumor type, presence and type of dysplasia, tumor grade, extent of invasion in the esophagus, lymphovascular invasion, response to neoadjuvant therapy, status of resection margin, ancillary studies, lymph node status, distant metastases, and pathological staging. Additional non-core elements considered useful to report included clinical information, specimen dimensions, macroscopic appearance of tumor, and coexistent pathology. CONCLUSIONS: This is the first international peer-reviewed structured reporting data set for surgically resected specimens of the esophagus. The ICCR carcinoma of the esophagus data set is recommended for routine use globally and is a valuable tool to support standardized reporting, to benefit patient care by providing diagnostic and prognostic best-practice parameters.


Subject(s)
Carcinoma/surgery , Datasets as Topic/standards , Esophageal Neoplasms/surgery , Esophagectomy , Esophagogastric Junction/surgery , Research Design/standards , Stomach Neoplasms/surgery , Benchmarking/standards , Carcinoma/secondary , Chemoradiotherapy, Adjuvant , Cooperative Behavior , Data Accuracy , Esophageal Neoplasms/pathology , Esophagogastric Junction/pathology , Evidence-Based Medicine/standards , Humans , International Cooperation , Neoadjuvant Therapy , Neoplasm Grading , Neoplasm Staging , Stomach Neoplasms/pathology , Treatment Outcome
19.
Br J Cancer ; 125(2): 155-163, 2021 07.
Article in English | MEDLINE | ID: mdl-33850304

ABSTRACT

The complexity of neoplasia and its treatment are a challenge to the formulation of general criteria that are applicable across solid cancers. Determining the number of prior lines of therapy (LoT) is critically important for optimising future treatment, conducting medication audits, and assessing eligibility for clinical trial enrolment. Currently, however, no accepted set of criteria or definitions exists to enumerate LoT. In this article, we seek to open a dialogue to address this challenge by proposing a systematic and comprehensive framework to determine LoT uniformly across solid malignancies. First, key terms, including LoT and 'clinical progression of disease' are defined. Next, we clarify which therapies should be assigned a LoT, and why. Finally, we propose reporting LoT in a novel and standardised format as LoT N (CLoT + PLoT), where CLoT is the number of systemic anti-cancer therapies (SACT) administered with curative intent and/or in the early setting, PLoT is the number of SACT given with palliative intent and/or in the advanced setting, and N is the sum of CLoT and PLoT. As a next step, the cancer research community should develop and adopt standardised guidelines for enumerating LoT in a uniform manner.


Subject(s)
Clinical Decision-Making/methods , Neoplasms/therapy , Datasets as Topic/standards , Decision Support Systems, Clinical , Delphi Technique , Humans
20.
Neural Netw ; 139: 358-370, 2021 Jul.
Article in English | MEDLINE | ID: mdl-33901772

ABSTRACT

As a major method for relation extraction, distantly supervised relation extraction (DSRE) suffered from the noisy label problem and class imbalance problem (these two problems are also common for many other NLP tasks, e.g., text classification). However, there seems no existing research in DSRE or other NLP tasks that can simultaneously solve both problems, which is a significant insufficiency in related researches. In this paper, we propose a loss function which is robust to noisy label and efficient for the imbalanced class dataset. More specific, first we quantify the negative impacts of the noisy label and class imbalance problems. And then we construct a loss function that can minimize these negative impacts through a linear programming method. As far as we know, this seems to be the first attempt to address the noisy label problem and class imbalance problem simultaneously. We evaluated the constructed loss function on the distantly labeled dataset, our artificially noised dataset, human-annotated dataset of Docred, as well as the artificially noised dataset of CoNLL 2003. Experimental results indicate that a DNN model adopting the constructed loss function can outperform other models that adopt the state-of-the-art noisy label robust or negative sample robust loss functions.


Subject(s)
Supervised Machine Learning , Datasets as Topic/standards , Signal-To-Noise Ratio
SELECTION OF CITATIONS
SEARCH DETAIL
...